Search CORE

43 research outputs found

Text Mining Infrastructure in R

Author: David Meyer
Ingo Feinerer
Kurt Hornik
Publication venue
Publication date
Field of study

During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

Research Papers in Economics

Lossless Selection Views under Constraints

Author: Feinerer Ingo
Franconi Enrico
Guagliardo Paolo
Publication venue
Publication date: 01/01/2014
Field of study

The problem of updating a database through a set of views consists in propagat-ing updates of the views to the base relations over which the view relations are defined, so that the changes to the database reflect exactly those to the views. This is a classical problem in database research, known as the view update prob

CiteSeerX

Edinburgh Research Explorer

A comparison of tools for teaching formal software verification

Author: D Gries
DA Patterson
E Dijkstra
E Stiller
EM Clarke
EM Clarke
Gernot Salzer
Ingo Feinerer
MRA Huth
PJ Denning
S Owre
W Ahrendt
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Text Mining of Supreme Administrative Court Jurisdictions

Author: Feinerer Ingo
Hornik Kurt
Publication venue: Department of Statistics and Mathematics, WU Vienna University of Economics and Business
Publication date: 01/01/2007
Field of study

Within the last decade text mining, i.e., extracting sensitive information from text corpora, has become a major factor in business intelligence. The automated textual analysis of law corpora is highly valuable because of its impact on a company's legal options and the raw amount of available jurisdiction. The study of supreme court jurisdiction and international law corpora is equally important due to its effects on business sectors. In this paper we use text mining methods to investigate Austrian supreme administrative court jurisdictions concerning dues and taxes. We analyze the law corpora using R with the new text mining package tm. Applications include clustering the jurisdiction documents into groups modeling tax classes (like income or value-added tax) and identifying jurisdiction properties. The findings are compared to results obtained by law experts.Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien

Text Clustering with String Kernels in R

Author: Feinerer Ingo
Karatzoglou Alexandros
Publication venue: Department of Statistics and Mathematics, WU Vienna University of Economics and Business
Publication date: 01/01/2006
Field of study

We present a package which provides a general framework, including tools and algorithms, for text mining in R using the S4 class system. Using this package and the kernlab R package we explore the use of kernel methods for clustering (e.g., kernel k-means and spectral clustering) on a set of text documents, using string kernels. We compare these methods to a more traditional clustering technique like k-means on a bag of word representation of the text and evaluate the viability of kernel-based methods as a text clustering technique. (author's abstract)Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien

Distributed Text Mining in R

Author: Feinerer Ingo
Hornik Kurt
Theußl Stefan
Publication venue: WU Vienna University of Economics and Business
Publication date: 16/03/2011
Field of study

R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately, adequate parallel programming models like MapReduce and the corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory. In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien

A text mining framework in R and its applications

Author: Feinerer Ingo
Publication venue
Publication date: 01/08/2008
Field of study

Text mining has become an established discipline both in research as in business intelligence. However, many existing text mining toolkits lack easy extensibility and provide only poor support for interacting with statistical computing environments. Therefore we propose a text mining framework for the statistical computing environment R which provides intelligent methods for corpora handling, meta data management, preprocessing, operations on documents, and data export. We present how well established text mining techniques can be applied in our framework and show how common text mining tasks can be performed utilizing our infrastructure. The second part in this thesis is dedicated to a set of realistic applications using our framework. The first application deals with the implementation of a sophisticated mailing list analysis, whereas the second example identifies the potential of text mining methods for business to consumer electronic commerce. The third application shows the benefits of text mining for law documents. Finally we present an application which deals with authorship attribution on the famous Wizard of Oz book series. (author's abstract

Elektronische Publikationen der Wirtschaftsuniversität Wien

Text Mining Infrastructure in R

Author: Feinerer Ingo
Hornik Kurt
Meyer David
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2008
Field of study

CiteSeerX

Crossref

Directory of Open Access Journals

Elektronische Publikationen der Wirtschaftsuniversität Wien

Journal of Statistical Software

Spherical k-Means Clustering

Author: Buchta Christian
Feinerer Ingo
Hornik Kurt
Kober Martin
Publication venue: 'Informa UK Limited'
Publication date: 01/09/2012
Field of study

Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents. This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment. (authors' abstract

Directory of Open Access Journals

Elektronische Publikationen der Wirtschaftsuniversität Wien

Journal of Statistical Software

An Approach to Incorporate Texts into a Social Network Analysis of Communication Graphs

Author: Bohn Angela
Feinerer Ingo
Hornik Kurt
Mair Patrick
Publication venue: Department of Statistics and Mathematics, WU Vienna University of Economics and Business
Publication date: 01/01/2009
Field of study

Social network analysis (SNA) provides tools to examine relationships between people. Text mining (TM) allows capturing the text they produce in Web 2.0 applications, for example, however it neglects their social structure. This paper applies an approach to combine the two methods named "content-based SNA" (CB-SNA). Using the R mailing lists, R-help and R-devel, we show how this combination can be used to describe people's interests and to find out if authors who have similar interests actually communicate. We find that the expected positive relationship between sharing interests and communicating gets stronger as the centrality scores of authors in the communication networks increase.Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien